========================================================
This report explores a dataset containing wine quality data for approximately 4898 wines
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
Our dataset consists of 13 variables with 4898 observations. First I will look at the distribution of some of the variables through plots and distribution tables.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
Sulphates, chlorides, residual sugar and citric acid all appear to be slightly skewed to the right.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
pH is normally distributed and doesn’t appear to have a skew.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
With a reduced bin width we observe a normal relationship in the distribution of wine density
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
Alcohol does not seem to have a meaningful distribution either linearly or logarithmically.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.878 6.000 9.000
Quality is normally distributed. 6 has the most wines while 9 has the least. The lowest quality is 3 and highest quality is 9
There are 4898 wines in the dataset with 12 features (fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol, quality). Other observations: Most wines have a quality of 6 Median density of wines is 0.9937
The median alcohol content(10.40) is less than the mean alcohol content(10.51)
The main features of interest in the dataset are the pH, density, alcohol and quality. I’d like to see how the other features influence these two.
The chemical components (chlorides, sulphates, citric.acid, free.sulfur.dioxide, total.sulfur.dioxide, fixed.acidity, volatile.acidity, residual.sugar) directly affect the density and pH of the wines. In turn, pH and density affect the alcohol content and the quality of the wine.
I did not create any new variables in the dataset.
I did not transform any data since the distributions didn’t seem that unusually distributed.
First, I look at the correlation matrix and table of the variables to establish relatiionships between the variables.
## X fixed.acidity volatile.acidity
## X 1.000000000 -0.25581431 0.002857966
## fixed.acidity -0.255814305 1.00000000 -0.022697290
## volatile.acidity 0.002857966 -0.02269729 1.000000000
## citric.acid -0.149899918 0.28918070 -0.149471811
## residual.sugar 0.006623775 0.08902070 0.064286060
## chlorides -0.045645192 0.02308564 0.070511571
## free.sulfur.dioxide -0.011928911 -0.04939586 -0.097011939
## total.sulfur.dioxide -0.161979037 0.09106976 0.089260504
## density -0.185976097 0.26533101 0.027113845
## pH -0.115774132 -0.42585829 -0.031915368
## sulphates 0.009807759 -0.01714299 -0.035728147
## alcohol 0.213656245 -0.12088112 0.067717943
## quality 0.035763247 -0.11366283 -0.194722969
## citric.acid residual.sugar chlorides
## X -0.149899918 0.006623775 -0.04564519
## fixed.acidity 0.289180698 0.089020701 0.02308564
## volatile.acidity -0.149471811 0.064286060 0.07051157
## citric.acid 1.000000000 0.094211624 0.11436445
## residual.sugar 0.094211624 1.000000000 0.08868454
## chlorides 0.114364448 0.088684536 1.00000000
## free.sulfur.dioxide 0.094077221 0.299098354 0.10139235
## total.sulfur.dioxide 0.121130798 0.401439311 0.19891030
## density 0.149502571 0.838966455 0.25721132
## pH -0.163748211 -0.194133454 -0.09043946
## sulphates 0.062330940 -0.026664366 0.01676288
## alcohol -0.075728730 -0.450631222 -0.36018871
## quality -0.009209091 -0.097576829 -0.20993441
## free.sulfur.dioxide total.sulfur.dioxide density
## X -0.0119289106 -0.161979037 -0.18597610
## fixed.acidity -0.0493958591 0.091069756 0.26533101
## volatile.acidity -0.0970119393 0.089260504 0.02711385
## citric.acid 0.0940772210 0.121130798 0.14950257
## residual.sugar 0.2990983537 0.401439311 0.83896645
## chlorides 0.1013923521 0.198910300 0.25721132
## free.sulfur.dioxide 1.0000000000 0.615500965 0.29421041
## total.sulfur.dioxide 0.6155009650 1.000000000 0.52988132
## density 0.2942104109 0.529881324 1.00000000
## pH -0.0006177961 0.002320972 -0.09359149
## sulphates 0.0592172458 0.134562367 0.07449315
## alcohol -0.2501039415 -0.448892102 -0.78013762
## quality 0.0081580671 -0.174737218 -0.30712331
## pH sulphates alcohol quality
## X -0.1157741316 0.009807759 0.21365624 0.035763247
## fixed.acidity -0.4258582910 -0.017142985 -0.12088112 -0.113662831
## volatile.acidity -0.0319153683 -0.035728147 0.06771794 -0.194722969
## citric.acid -0.1637482114 0.062330940 -0.07572873 -0.009209091
## residual.sugar -0.1941334540 -0.026664366 -0.45063122 -0.097576829
## chlorides -0.0904394560 0.016762884 -0.36018871 -0.209934411
## free.sulfur.dioxide -0.0006177961 0.059217246 -0.25010394 0.008158067
## total.sulfur.dioxide 0.0023209718 0.134562367 -0.44889210 -0.174737218
## density -0.0935914935 0.074493149 -0.78013762 -0.307123313
## pH 1.0000000000 0.155951497 0.12143210 0.099427246
## sulphates 0.1559514973 1.000000000 -0.01743277 0.053677877
## alcohol 0.1214320987 -0.017432772 1.00000000 0.435574715
## quality 0.0994272457 0.053677877 0.43557472 1.000000000
From the plot it appears that the following relationships have strong correlations: alcohol vs density, density vs residual.sugar, density vs total.sulfur.dioxide, quality vs alcohol
Let’s explore various relationships that alcohol has with other variables.
Fixed acidity and residual sugar do not seem to have a strong correlation with each other.
Residual sugar has a strong positive correlation with the density of the wines
Citric acid and volatile acidity do not appear to have a correlation with each other.
Strong positive correlation exists between free sulfur dioxide and total sulfur dioxide.
Fixed acidity and volatile acidity do not appear to have a strong correlation with each other.
There is a negative correlation between alcohol and density of the wines. Alcohol content also has a strong relationship with the quality of wine. Wines of a quality of 9 have a smaller range in alcohol content compared to wines in the quality levels and also have the highest median in alcohol content. It is, however, interesting to not that the highest alcohol content of the wines occurs at quality 7. The pH does not seem to have any strong correlation to the alcohol content of the wines which is contrary to what I expected. Density has a slight negative correlation with the alcohol content of the wines.
Fixed acidity and volatile acidity do not appear to be correlated with each other, which is counter to what I expected. The total sulfur dioxide is strongly positively correlated with the free sulfur dioxide in the wines.
The strongest positive relationship is between density and residual sugar of the wines while the strongest negative relationship is between alcohol and density. Looking at the plot, we do have one extreme outlier that could be influencing the overall relationship.
I want to see how acidity in general affects alcohol content of the wine. I will compare fixed.acidity, volatile.acidity, citric.acid,pH and alcohol
The two levels of acidity (fixed and volatile) do not seem to have that much of a correlation when it comes to the alcohol content of the wines. pH appears to be some sort of average since it is in the midpoint of the two acid types.
Next I make some plots to investigate how density, alcohol and pH relate to each other?
Strongest relationship found in the correlation matrix was between residual.sugar and density. For the amount of alcohol, we have residual.sugar having a positive correlation, and total.sulfur and density are strongly negatively correlated, Let’s find a model for this:
##
## Call:
## lm(formula = alcohol ~ residual.sugar + total.sulfur.dioxide +
## density, data = wqdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9956 -0.4163 -0.0501 0.3534 16.2984
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.626e+02 5.808e+00 96.860 <2e-16 ***
## residual.sugar 1.668e-01 3.207e-03 51.998 <2e-16 ***
## total.sulfur.dioxide -2.368e-04 2.456e-04 -0.964 0.335
## density -5.565e+02 5.873e+00 -94.745 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6167 on 4894 degrees of freedom
## Multiple R-squared: 0.749, Adjusted R-squared: 0.7489
## F-statistic: 4869 on 3 and 4894 DF, p-value: < 2.2e-16
Wines with a quality of 9 appear to have a smaller range in alcohol/pH and also have the highest median of alcohol/pH.
Wines that have a quality of 9 have the smallest range in alcohol/pH and also have the highest alcohol/pH of all the wines. ### Were there any interesting or surprising interactions between features? The levels of factors that influence the pH appear do not appear to have any correlation with the level of alcohol in the wines. Only citric. acid seems to have any correlation with pH and the level of alcohol in the wines. ### OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model. I made a model to estimate the alcohol level of the wines using the residual sugar, total sulfur dioxide and density. The variables account for 74.9% of the variability in the alcohol in the wine. ——
The density of wines is negatively correlated with the alcohol content of the wines.There is an interesting outlier that has high density but just a slightly above average alcohol content. Most of the points are clustered so this outlier does not affect the average that much.
Wines with a quality of 9 seems to have two prominent peaks in density as alcohol increases and is otherwise at zero density while the other qualities have a more gradual distribution in the density.
Wines with a quality of 9 have the highest median alcohol/pH concentration but also appear to have the smallest range in alcohol/pH.
In this Analysis, I looked at a dataset on white wine quality. This dataset comes from work done by Cortez et al.,2009.There are 4898 observations in this dataset and 13 variables. I started by understanding the individual variables in the dataset and then I explored the interesting relations these variables have by making plots. Eventually I made a linear model to estimate the quality of wine using variables that had a high correlation with each other.
There was a clear negative correlation between the quality of wine and the density of wine. I was surprised that the fixed acidity and the variable acidity did not appear to have a correlation with each other. I struggled to make sense of this since I thought acidity would have a correlation with each other.
Some limitations with this model include the source data. It doesn’t account for the seasonality of wine and the regions where the grapes were grown. Adding these variables would have made the dataset more robust. Future work should include these variables in the analysis.